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Abstract 

Measuring and linking competencies require special instruments, special data collection designs, 
and special statistical models. The measurement instruments are tests or tests fonns, which can 
be used in the following situations: The same test can be given repeatedly; two or more parallel 
tests forms (i.e., forms intended to be similar in difficulty and content and that still need to 
equated) can be given at different time points; or two or more test forms that may be less parallel 
to various degrees can be given at different points in time. In some circumstances, the goal of the 
analysis is to make the scores of parallel test forms comparable across different time points and 
different samples of relatively similar ability (horizontal equating). In other situations, we aim at 
the comparability of scores of tests forms that are not parallel, and although they are intended to 
measure the same competencies, they are of different difficulties and are taken by samples that 
show large differences in ability (vertical linking). In other cases, we want to evaluate the change 
in competencies over time (with or without covariates) for the same individuals measured by the 
same instrument or by different instruments (longitudinal linking). This paper will briefly discuss 
a variety of techniques for relating scale scores from different data collections points and will 
then discuss models for measuring growth. Each of these areas is a large field in itself, and each 
has a potential strong impact on educational policies such as the No Child Left Behind Act (for 
example, the vertical linking of state or national assessments and longitudinal studies are a 
potential basis for informative analyses for the policy makers), on the life of students and parents 
(equating of achievement tests), or on the life of professionals (equating of licensure tests). In 
these times when more and more standardized testing is used for assessing competencies in 
different domains nationally and internationally, we are also discovering more challenges in 
ensuring that the process and the results are fair and accurate. In turn, these challenges and these 
new social implications open the door toward more research in support of fair assessments, both 
in improving upon the test construction process and in advancing the statistical methods 
involved. 

Key words: Comparability of scores, vertical linking, horizontal linking, longitudinal linking 
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1. Introduction 


All linking processes have basically three components: a measurement instrument, a data 
collection design, and a statistical method for achieving the desired form of comparability of two 
or more test forms. In this paper we will differentiate between horizontal, vertical, and 
longitudinal linking based on the first two components; the third component will be addressed by 
describing general tools used within any of the linking processes. The tenn linking is used with 
slightly different meanings in the field of educational assessments: (a) as a general term for 
denoting a relationship between test fonns (at the total score level, at the parameter level, etc.); 
(b) as a weaker form of equating; and (c) as a synonym to the item response theory (IRT) item 
parameters calibration. In this paper, the term linking is mostly used as in (a) and (c) above. 

Also, in this paper we take the view that modelling change or growth over time with or 
without covariates relies on an implicit assumption that the competency measured is scaled 
across the measurement points, and therefore, we present the models for measuring growth as a 
higher level built on top of the linking stage. 

This paper focuses on two types of research questions that arise in educational 
measurement: (a) how to compare the competency measured by two test forms given to different 
groups of examinees, and (b) how to measure change in the competency measured by the same 
test form or two or more test forms given to the same group or different groups of examinees at 
different points in time. 

In order to answer each of these two questions, the outcome variable (i.e., the 
competency measured) must be comparable across time points (i.e., each scale point of the 
measure must retain an identical meaning over time). The outcome variable must also remain 
construct-valid for at least the entire period of observation and maybe beyond, if prediction of 
future success is part of the study. These two requirements speak to the necessity of establishing 
a common scale for the measurement instruments/tests we use, or, in other words, establishing a 
link between the test forms (see Harris, Hendrickson, Tong, Shin, & Shyu, 2004; Patz, Yao, 

Chia, Lewis, & Hoskens, 2003). This is true in educational contexts even if we use the same test 
form at different points in time, if the construct itself might evolve over time. 

When do questions like these arise? For example, in achievement and licensure tests the 
meaning of the reporting scale should be preserved across administrations and the fairness for the 
examinees that take different test forms should be insured. In such a process, also called a 
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horizontal equating process, two test forms X and Y that are built to be parallel are placed on the 
same scale. By parallel test forms, we mean here test forms that were constructed following the 
same content and statistical specifications, that were intended to be similar in difficulty, and that 
were built with the intention of measuring the same construct. However, since the construction of 
parallel test forms is never perfect, test equating is used to make the scores on the two tests forms 
interchangeable. This form of linking of two parallel test forms is called equating and is a 
preliminary step in answering the first research question above. In the (horizontal) equating 
designs, even if the groups of examinees who take the parallel forms differ in ability, these 
differences are usually small. Equating is the most stringent type of horizontal linking, and it 
refers specifically to linking/equating scores. Other types of (weaker) horizontal linkings are 
concordances (where tests that are not built to be parallel are linked on relatively similar 
populations of test takers—see Holland, Dorans, & Petersen, in press). 

In other circumstances, we need to make scores of test forms—that measure the same 
domain or construct but differ in difficulty—comparable across years of study, to enable 
measurement of growth in a particular domain. This particular type of linking is often called 
vertical Unking in the field of educational measurement. In a vertical linking design, the 
examinees that take the two (or more) test forms are from representative samples of their cohorts, 
the examinees are assessed at the same point(s) in time, the samples of examinees that take 
different test forms differ significantly in ability, and the fonns to be linked are not parallel 
(Harris et al., 2004; Harris & Hoover, 1987; Kolen & Brennan, 2004; Yen & Burket, 1997). 
Vertical linking is a preliminary requirement for answering the second research question. Many 
elementary and secondary test batteries report scores on a vertical scale, such as Iowa Test of 
Basic Skills (ITBS; Hoover, Dunbar, & Frisbie, 2001, 2003) or ACT’s Educational Planning and 
Assessment System (EPAS; ACT, Inc., 2000). Vertical linking is a weaker type of linking than 
equating. 

In addition to the two practical circumstances described above, there are assessments that 
focus on individual development over time under a particular type of treatment or educational 
exposure. Such situations arise in formative assessments as well as in summative assessments. In 
a longitudinal linking process, the measurement instrument might be the same, might be parallel 
test forms, or might be forms that are less parallel across time points but are taken by the same 
individuals at different points in time. 
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Longitudinal linking in a sense is related to vertical linking but it requires a more 
restrictive design. In the longitudinal (panel) design, the same subjects are asked to respond the 
same or different measurement instruments/tests at different points in time. This approach 
enables measurement of growth for the individual examinees as a function of time and (linked) 
test score or as a function of other variables of interest. The difference between longitudinal 
linking and vertical linking resides in the difference between collecting data from a longitudinal 
design and collecting data from a cross-sectional design. The instrument of measurement might 
or might not be the same in both designs. The linking that underlies longitudinal studies in the 
educational context is mostly carried out in research mode, given the costs associated with 
testing the same subjects over time. However, there are valuable longitudinal studies in the 
educational field, such as the Early Childhood Longitudinal Study (ECLS) described in Rock, 
Pollack, and Weiss (2004). The ECLS attempts to identify different patterns of cognitive growth 
in kindergarten and first grade associated with selected subpopulations based upon a nationally 
representative longitudinal sample. Of special interest in the Rock et al. study is the estimation of 
the growth rates of subpopulations of children that are considered to be at risk educationally 
and/or economically. The authors use vertically equated measurement instruments in their 
longitudinal analysis. Also, it is worth noting the increase in supply and demand for formative 
classroom assessments (Stanford Learning First from Harcourt Assessment, Inc., 2006; System 
5™ from ETS, 2006a), which emulate a longitudinal design, in the sense that the same 
examinees are tested in a classroom environment over time. The items used for these test fonns 
sometimes come from precalibrated item banks and therefore, in these cases, no linking takes 
place after the tests have been administered. Even as such, we envision that these valuable 
longitudinal types of data will lend themselves well to studies for measuring growth. A caveat is 
that these data is subject to additional statistical challenges such as estimating models with small 
sample sizes. 

It is common to use the terminology of vertical linking in the field of education, 
specifically when the interest is in measuring growth in a particular domain that is taught over 
several years in primary education (therefore, vertical). In this context, two or more measurement 
instruments (i.e., test forms) are used; the test forms differ since they cover the domain at 
different time points. In some of these circumstances, the vertical scale is used to express the 
growth in a particular domain as a function of the scaled scores of the subjects and the time 
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points. It is worth noting that it is only recently that schools and schools districts have started to 
consider the use of vertical scales for measuring growth in performance, although some vertical 
scaling procedures have been used with elementary achievement test batteries, such as ITBS 
(Hoover et ah, 2001, 2003). 

In contrast, longitudinal linking, a tenninology used in many other fields (medicine, 
biology, economics, etc.), often uses the same measurement instrument over time or, if not, it 
assumes that a common scale has been created and attempts to explain the intra- and 
interindividual differences by the means of complex explanatory models, among which the best 
known are: (explanatory) IRT models, hierarchical (linear) models (HLM), and structural 
equation models (SEM). 

In this paper we give an overview of the existing methodologies for establishing a 
common scale for the three types of settings—horizontal, vertical, and longitudinal—with the 
scores reported at the individual level. This paper also addresses briefly the linking or trend 
analyses conducted in large-scale assessments that use a matrix design for collecting data and 
report results at the group level; we will describe this data collection design and the calibration 
methods usually employed in these assessments because the matrix design is also used for other 
linking purposes. 

The first part of the paper focuses on equating and vertical linking and the second part 
focuses on measuring growth. This paper provides a theoretical and descriptive overview and 
does not focus on practical examples. For details and examples of vertical linking, the reader 
should consult Kolen and Brennan (2004). 

In the next section, we will describe the data collection designs that are usually employed 
for linking. Then we will give an overview of the statistical methods used to achieve score 
comparability. 


2. Data Collection Designs 

The role of the data collection design is slightly different for observed-score linking as 
compared to IRT linking. In the framework of observed-score linking, the role of the data 
collection design is to disentangle the differences in the test forms from the differences in the 
abilities/competencies of the examinees that take those test fonns. In consequence, there are two 
major classes of data collection designs: those where the examinees come from the same 
population (equivalent groups [EG] design, single group [SG] design) and those where the 
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examinees are drawn from different populations and a set of items, called an anchor test, is taken 
by the test takers from these populations (non-equivalent groups design with anchor test [NEAT] 
design). In an IRT framework, which is built around the claim that there is a separation of the 
influence of items and persons on the item responses, 1 the main purpose of the data collection 
design is to place the item parameters from test forms on the same scale (which implicitly adjusts 
for differences in ability). In the IRT context, the role of the data collection design is more 
complicated because it also has to address the intrinsic indetenninacy of the IRT models and 
therefore, the need for scale alignment, in addition to the differences in the test forms. 

In consequence, in the NEAT design and for observed-score equating, the anchor set is used 
to adjust for the possible differences in ability of the groups of examinees. In the IRT context, the 
items from the anchor test are used for aligning the scales, assuming that the IRT model fits the data 
from the two populations well (and of course, implicitly adjusting for the differences in ability, since 
the joint origin of the aligned scales is defined by the anchor test items). 

These three data collection designs (EG, SG, NEAT) are the basis for building more 
complex designs as will be shown in this section. How these three data collection designs are 
used for horizontal linking and equating will be described in detail in the following section. 

Then, the data collection designs for vertical linking and the matrix design for large-scale 
educational surveys and other assessments will be described. Longitudinal studies can use 
any of or a combination of these designs, and the restriction that the same individuals are 
tested over time represents a part of the design that establishes a link across data collection 
points. 

2.1 Equivalent Groups (EG) and Single Group (SG) Designs 

Assume there are two test forms, X and Y, for which scores are to be linked or equated on 
a target population of examinees. In the EG design, there is only one population of test takers, 
and two samples are (randomly) drawn from it. Each sample of test takers takes one test form, X 
or Y. This design requires large samples for stable equating or linking results. In the SG design, 
one sample randomly drawn from a target population of examinees takes both tests forms to be 
equated. The sample size required for an accurate equating in an SG design is much smaller, 
since the correlation between the two tests is taken into account in computing the standard errors 
of equating. However, the underlying assumption in an SG design is that there are no order 
effects associated with the test taken second. The data structure for the EG and SG designs is 
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illustrated in Tables 1 and 2 (see also von Davier, Holland, & Thayer, 2004). These data 
collection designs are used in horizontal linking and in equating. 

These two designs are also the basis for the other types of linking: vertical and 
longitudinal linking. Usually, these two designs are combined in various ways as will be 
described next (see also von Davier et ah, 2004). 

Table 1 


Data Collection in an Equivalent Groups (EG) Design 


Population 

Sample 

X 

Y 

P 

1 

V 


P 

2 


V 

Table 2 




Data Collection in a Single Group (SG) Design 



Population 

Sample 

X 

Y 

P 

1 

V 

V 


2.2 Non-Equivalent Groups With Anchor Tests (NEAT) 

In the NEAT design (see, for example, von Davier et ah, 2004; Kolen & Brennan, 2004), 
there are two populations, P and Q, of test takers, and a sample of examinees from each. The 
sample from P takes test X, the sample from Q takes test Y, and both samples take a set of 
common items, the anchor test, V. In observed-score equating and horizontal linking, the 
common set of items is used to adjust for the differences in ability between the two non¬ 
equivalent groups; in an IRT context, the common set of items is used for scale-linking purposes, 
which is a way of accounting for differences between populations through constraints on the item 
parameters in the anchor test. The NEAT design is often used when only one test form can be 
administered at one test administration because of test security or other practical concerns. The 
two populations may not be equivalent (i.e., the two samples are not from a common 
population). The NEAT designs can have internal (items in V are also part of both X and Y) or 
external (items in V are neither in X nor in Y) anchor tests. The data structure for a NEAT design 
is described in Table 3. Usually, when the data is collected following a NEAT design for 
horizontal linking or equating purposes, the differences in ability between P and Q are relatively 
small. 
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From Tables 2 and 3, we see that the NEAT design contains two independent SG designs, 
one on population P and one on population Q. 

Table 3 


Data Collection in a Non-Equivalent Groups With Anchor Test (NEAT) Design 


Population 

Sample 

X 

Y 

V 

P 

1 



V 

Q 

2 





2.3 Data Collection Designs Used in Vertical linking 

In vertical linking, the designs used for data collection are combinations of the three 
designs described above. Usually, there are three categories of such combination designs: non¬ 
equivalent groups that have levels with (a) overlapping content, (b) no overlapping content, or 
(c) common items and a scaling test (Patz, 2005). 

In order to better adjust for the differences in competency at the extreme levels (in an 
educational context that is often a lower grade, such as grade three or four, and the highest 
grade), a combination of the designs is used in some cases, as described in Tables 4, 5, and 6. 

In Table 4, we describe a version of the NEAT design used in vertical linking. The 
vertical linking process will use the common sets of items that overlap on two levels (f Z, W) to 
place scores on the test forms X, Y, Z, W, S (which are constructed to measure the same construct 
but are of different difficulties) on the same scale; in this way, each test taker can be placed on 
the common scale. By doing this, the study of growth in a particular domain for each individual 
becomes possible (under appropriate assumptions and when the whole process is carried out 
carefully). 

Table 5 shows a case where a group of examinees from two adjacent levels takes two test 
forms and a group from the next two adjacent levels takes two test forms, one of them being the 
same as in the previous adjacent levels. In both cases described in Tables 4 and 5, the groups of 
examinees, P\ to P 4 , who take different test fonns, differ in ability. 

The design from Table 5 is enhanced by randomly assigning the test takers from each 
grade to take the common set of items that will be used to link back to the previous level or to 
one that will be used to link forward to the next level, in which case a combination of the designs 
described in Tables 1, 2, and 3 is used. 
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Table 4 

Example of Data Collection Design With Overlapping Content Used for Vertical Linking (or 


Common Items) 


Population 

Sample 

X 

Y 

Z 

w 

s 

Pi 

1 


V 




Pi 

1 



V 



Pi 

1 




V 


Pa 

1 






Table 5 







Example of Data Collection Design With No Overlapping Content Used for Vertical Linking 

Population 

Sample 

X 

Y 

z 

w 

5 

-Pi &Pi 

1 


V 




Pi &Pi 

1 



V 



Pi &Pa 

1 



V 

V 


P4&P5 

1 







In Table 6, we describe another version of the NEAT design used in the vertical linking 
process (this design is usually called a scaling test—see Kolen & Brennan, 2004; Patz, 2005). 
Here there is a set of common-items for each pair of adjacent levels to adjust for the differences 
in the populations from adjacent levels and will place the test forms X, Y, Z, and W (which again 
are constructed to measure the same construct but are of different difficulties) on the same scale. 
The scaling test V contains items that are appropriate to all levels. Obviously, there are numerous 
challenges associated with its construction because of the required appropriateness of the 
construct and difficulty across all levels. 

Table 6 


Example of Data Collection Design With a Scaling Test Used for Vertical Linking 


Population 

Sample 

X 

Vj 

Y 

v 2 

z 

V 3 

w 

V 

PI 

1 


V 






V 

P2 

2 




V 




V 

P3 

1 




V 


V 


V 

P4 

2 








V 


The vertical linking has its challenges at different stages of the process: From Table 6, we 
can see that constructing a set of common items that covers several adjacent levels is a challenge 
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and might adjust less appropriately for the differences in competency from the extreme levels. 
There are other challenges for vertical linking that are not addressed here, such as how to 
construct the tests, the anchors, and the scaling test; how to address the possible shifts in 
construct; and how to introduce new test forms. 

2.4 Balanced Incomplete Block (BIB) Designs 

In large-scale educational survey assessments as well as for some state assessments, 
balanced incomplete block (BIB) or matrix designs are used. BIB designs assume that the 
different test forms are assigned to the same population (not different populations as in the 
NEAT design) and that every student only takes a subset of the items, called blocks of items, that 
are combined in intrinsic ways. Such data collection designs are used in all major international 
survey assessments such as TIMSS, PISA, and PIRLS and are based on several assumptions. The 
data collected from these designs are analyzed using extensions of IRT methods that were first 
introduced by ETS in projects for the National Assessment of Educational Progress (NAEP; 
Allen, Jenkins, & Schoeps, 2004). We will discuss these procedures in detail later. Because not 
every test taker takes all the items, these procedures are mostly used for large-scale survey 
assessments, where the reported results are at the group level (as opposed to individual level, as 
in the cases discussed above). 

However, there are assessments where individual scores are reported that also use a 
matrix design, as is the case of Massachusetts Comprehensive Assessment System (MCAS), 
which is a state assessment (Massachusetts Department of Education, 2001): 

The matrix-sampled items are used to equate test forms across MCAS administrations 
and to field-test items for possible future use as common items. In School and District 
Reports, the Subject Area Subscore pages include data generated from responses to both 
common and matrix-sampled items: this is the only instance in which matrix-sampled 
items are used to generate MCAS results. All other school and district results are 
generated from student responses to common items only. (p. 9) 

Table 7 shows a very small example of a balanced incomplete block design using 5 
blocks, B\ to f? 5 , in three positions combined into five booklets. 
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Table 7 

Simple Example of a Balanced Incomplete Block (BIB) Design 


D rv f-v 1/ 1 /-x-f- _ 


Position 


oooKiei 

1 

2 

3 

1 

Bi 

b 4 

b 3 

2 

b 4 

b 2 

B 5 

3 

B 3 

b 5 

B 2 

4 

b 2 

Bi 

B 1 

5 

b 5 

B 1 

b 4 


In the balanced incomplete block design, all blocks are present in all three booklet 
positions, although all blocks are not combined with all other blocks, and not every test taker 
sees all the items. Usually, the number of booklets (and blocks) is much larger than in the 
example above, and blocks may contain items from multiple subdomains, such as reading, 
mathematics, and science. In such cases, the number of possible combinations is very large, and 
the number of booklets is comparably small, so that the incomplete pairing in the BIB design 
becomes an issue. 

In cases with multiple domains, there may be booklets that only contain mathematics 
items (say M\ to M 5 ), only science items (say Si to S5), or only reading items (Ri to R5). This 
precludes the estimation of covariances between the three domains, but enables an estimation 
that is more accurate within domain. These focused booklets may be paired with booklets that 
contain at least two different domains, say reading and mathematics or science and mathematics, 
so that covariances between these subject matters can be estimated. This obviously means 
increasing the number of different booklets. 

The logistics of administering booklets constructed using the BIB design is roughly as 
follows: The set of booklets is spiralled throughout the sample, so that classrooms tested within 
the assessment will receive a (pseudo-)random selection of the booklets (a similar approach as in 
an EG design). In this way, when the set of booklets are given out to the examinees during test 
administration, clustering of booklets will be minimized and approximately the same number of 
booklets will be given to approximately equivalent subsamples. 

Either IRT and traditional equating methods can be used for most of the equating designs, 
including the NEAT design. However, if the data are collected following a BIB design, then the 
large missing-by-design data feature can be addressed only by the IRT methods. Similarly, in 
vertical linking settings, where the differences between the abilities of the groups of test takers 
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are large, the IRT methodologies are more appropriate for linking purposes than the observed- 
score methods, even though there are circumstances where observed-score equating methods 
have been used. 

In this section we have described the data collection designs used in different types of 
linking processes. In the next section, we will provide a description of the statistical tools used to 
conduct the actual linking once the data have been collected. 

3. Horizontal and Vertical Linking and Scaling 

The role of the data collection designs is to provide suitable raw data for statistical 
methods aimed at producing scale linkages across time points, populations, or multiple test 
forms. In this section, we provide a succinct description of the statistical methods usually 
employed for equating and linking. 

The statistical methods used to achieve score comparability may be classified as IRT 
methods (Hambleton, Swaminathan, & Rogers, 1991; Lord, 1980; and many others), as classical 
or traditional equating methods that are based on observed scores (Kolen & Brennan, 2004) or as 
methods based on the classical test theory, such as the true-score Levine linear equating method 
(Kolen & Brennan). 

If IRT is used in the equating or linking of the test scores, it is necessary to first use some 
sort of linking procedure or calibration method to place the IRT parameter estimates on a 
common scale (see von Davier & von Davier, 2004). Once this is accomplished, then additional 
methods, such as IRT true or observed-score equating (Kolen & Brennan, 2004) can be 
undertaken. Then a third step is employed, which refers to placing the raw scores onto some 
reporting scale. However, in many settings, mainly in vertical linking and in large-scale 
assessments, after the calibration is done, the ability or person parameter estimates are directly 
placed onto the reporting scale (Yen, 1984), and the second step is skipped altogether. 

In this section, we will first focus on horizontal equating and linking. We will describe 
the IRT linking/calibration procedures (i.e., procedures for placing the item parameters on the 
same scale) used for data collection designs that involve common items (for the EG and SG 
designs, the linking of item parameters is achieved by calibrating jointly the data from the two 
test forms). In addition, the IRT true score equating will be briefly described. Then, traditional 
observed-score equating methods will be succinctly presented. 
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In the second part of this section, the scaling methodology used for vertical linking will 
be described in addition to IRT based methods used to calibrate and link in surveys that use BIB 
or matrix designs. 

3.1 IRT Calibration Methods for the NEAT Design 

In this section, we describe the IRT calibration or IRT linking methods used in the NEAT 
design. More exactly, we discuss the usual IRT-linking methods: mean-sigma and mean-mean, 
concurrent calibration, fixed parameters calibration, the Stocking and Lord characteristic curves 
approach, and the Haebara characteristic curves approach (see Kolen & Brennan, 2004, chapter 
6 , for a detailed description of these methods). The item calibration procedures for the matrix 
design will be discussed in section 3.2. 

Unidimensional IRT models express the probability of a response z ni of a given person, n 
(n = 1,..., N), to a given item, i (i = 1,..., I), as a function of the person’s competency or ability 
(which is a latent variable), denoted 0 n , and a possibly vector valued item parameter, ff that is, 

P„i = P(X= z ni ) = f(z ni , e n , po 

In the case of the well-known three-parameter logistic model (3PL) model (Lord & 
Novick, 1968) that is used to fit data from dichotomous items, the item parameter vector is three- 
dimensional. Its dimensions are the slope or discrimination that is usually denoted by a, the 
difficulty b, and the guessing parameter c, respectively, that is, P'=(a„ b„ q). However, most 
results presented here do not depend on the specific choice of the model and apply to models for 
both dichotomous and polytomous data. 

Table 3 shows that in the NEAT design, X is not observed in population Q, and Y is not 
observed in population P. To overcome this feature of the NEAT design, all linking methods 
developed for the NEAT design (both observed-score and IRT methods) must make additional 
assumptions of a type that does not arise in the other linking designs. IRT-based linking makes 
the following assumptions: 

Assumptions. The tests to be equated, X and Y, and the anchor, V, are all unidimensional 
(i.e., all items measure the same unidimensional construct), carefully constructed tests, in which 
the local independence assumption holds (Hambleton et al., 1991); the chosen IRT model fits the 
data well. 
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When conducting scale linking in the NEAT design, the parameters of the model from 
different test forms need to be placed on the same IRT scale. The methods available for the 
scaling or calibration purposes are mentioned below. Operationally, the necessary item 
calibrations are usually obtained using the marginal maximum likelihood (MML) estimation 
method for IRT models (Bock & Aitkin, 1981). In the past, the joint maximum likelihood (JML) 
method has been used; conditional maximum likelihood (CML) methods are used for one 
parameter logistic IRT models (1PL, Rasch model) or for 1PL models with fixed slopes such as 
the OPLM (Verhelst & Glas, 1995), and in some recent developments, empirical Bayesian 
estimations methods are used in the context of Markov chain Monte Carlo (MCMC) estimation 
(see Patz & Junker, 1999). Note that the approaches described below can be employed using any 
of the available statistical estimation methods: marginal maximum likelihood (which is 
implemented in, for example, the software package PARSCALE, Muraki & Bock, 1997), joint 
maximum likelihood (for example, the method used by the software package LOGIST, 
Wingersky, Barton, & Lord, 1982), and Bayesian inference with MCMC (which can be 
implemented using, for example, the software package BUGS, Spiegelhalter, Thomas, Best, & 
Gilks, 1995) methods. 

Joint items calibration with no Unking or separate calibration. Separate calibration in a 
NEAT design can be obtained in two ways: (a) by estimating jointly all the parameters given all 
the data but without restrictions on the parameters of the common items and treating the items 
that an examinee did not take as not administered or (b) by carrying out the estimations 
separately: The item and ability distribution parameters for population P are estimated given 
data, X and V in P, separately from the item and ability distribution parameters for population Q 
given data, Y and V in Q. The two methods provide similar results and accomplish no linking. 
The usual practice is to use the second approach and to do the calibration separately on the two 
populations. However, the first approach provides the opportunity for testing the hypothesis that 
the common items behave as common items as intended (see von Davier & von Davier, 2004, in 
press; or Glas, 1999). 

Concurrent calibration. As an alternative, the item parameters from X, V (in both 
populations), and Y can be estimated jointly, assuming that the items in V are the same for both 
populations and coding the items that an examinee did not take as not administered, since these 
outcomes were unobserved and are missing by design. 
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Fixed parameters scale linking. In this method, common items whose parameters are 
known (for example, from a previous administration calibration or a separate calibration) are 
anchored or fixed to their known estimates during calibration of other fonns. By treating the 
common item parameters as known and, therefore, not reestimating them, the remaining item 
parameters from the not common set of items in the two test forms are forced onto the same scale 
as the items with fixed parameters. This calibration procedure is more restrictive than concurrent 
calibration, and it is not appropriate for cases where the populations who take the two test forms 
differ significantly in ability. 

Mean-mean and mean-var IRT scale linkings. If an IRT model fits the data, any linear 
transformation (with slope A and intercept B) of the theta scale also fits these data, provided that 
the item parameters are also transformed in the same way (see, for example, Kolen & Brennan, 
2004, chapter 6). In the NEAT design, the most straightforward way to transform scales when 
the parameters were estimated separately is to use the means and standard deviations of the item 
parameter estimates of the common items for computing the slope and the intercept of the linear 
transformation. Loyd and Hoover (1980) described the mean-mean method, where the mean of 
the a-parameter/slope estimates for the common items is used to estimate the slope of the linear 
transformation. The mean of the b-parameter/difficulty estimates of the common items is then 
used to estimate the intercept of the linear transformation (see Kolen & Brennan). The mean-var 
IRT scale linkage (Marco, 1977) can obviously be implemented in the same way, with only a 
slight difference in the restrictions used. The means and the standard deviations of the b- 
parameters are used to estimate the slope and the intercept of the linear transformation. Both 
methods are seldom used in practice and tend to be inadequate if the groups taking the test forms 
differ in abilities. 

Stocking and Lord scale linking. Characteristic curves transformation methods were 
proposed (Haebara, 1980; Stocking & Lord, 1983) to avoid some issues related to the mean- 
mean and mean-var approaches, such as the fact that various combinations of the item parameter 
estimates produce almost identical item characteristic curves over the range of ability at which 
most examinees score. 

The Stocking and Lord IRT scale linking finds parameters for the linear transformation of 
the item parameters from the anchor set in one population (say Q) that matches the test 
characteristic function of the anchor in the reference population (say P). 
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Haebara scale Unking. Haebara (1980) expressed the differences between the 
characteristic curves as the sum of the squared differences between the item characteristic 
functions for each item over the common items for examinees of a particular ability 6 n . The 
Haebara method is more restrictive than the Stocking and Lord method because the restrictions 
take place at the item level (i.e., for each item from the set of common items), while the Stocking 
and Lord approach poses a global restriction at the anchor test level. 

von Davier and von Davier (2004) proposed a new perspective on IRT scale linking by 
viewing any linking function as a restriction function on the joint log-likelihood function based 
on all the data. Rewriting any linking as a restriction function and estimating the model 
parameters under this restriction implies a larger flexibility in the linking process—when dealing 
with vertical linking, for example. This new method can incorporate the modelling of growth, 
possibly expressed as a hierarchical structure on the item parameters in the anchor (discussed in 
more detail below). The approach presented in the paper by von Davier and von Davier (2004) 
may easily be extended to multidimensional IRT models, at least for simple structure multiscale 
IRT models (like the one used in NAEP and other large scale assessments). 

3.2 IRT Calibration in the Balanced Incomplete Block Design 

Operationally, matrix samples of item responses from balanced incomplete block designs 
are calibrated jointly using MML estimation of IRT models (Bock & Aitkin, 1981), since MML 
estimation does not require that all items be taken by all examinees. The number of items across 
all blocks in survey assessments reaches hundreds, and the number of examinees in nationally 
representative samples reaches hundreds of thousands. Due to computational constraints in the 
past, some assessment programs use only a subset of examinees to carry out item parameter 
estimation. This reduction of the sample used for estimation is no longer necessary when using 
MML estimation with recent computer hardware. 

MML estimation of IRT models with a matrix design can be done both with Rasch type 
models (or 1PL model) and with two-parameter logistic (2PL) or three-parameter logistic (3PL) 
models, as well as mixed models for dichotomous and polytomous response data. The approach 
taken is a multistage estimation; Patz and Junker (1999) used the phrase divide and conquer for 
this approach. First step is estimating the structural parameters of the measurement model (item 
parameters), the second step is estimating the structural parameters of a population model (often 
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a latent regression on grouping variables) that may include a potentially large number of 
covariates; Then, the third step follows, which is concerned with estimating distributions, 
percentiles, and percentages above cut points for policy relevant subpopulations. 

It is important to note that IRT models include homogeneity assumptions that may not be 
met by default in complex samples as well as samples from composite populations. Careful item 
selection, and if necessary, treatments of a small number of misfitting items such as splitting 
items (i.e., different items parameters have to be assumed in different groups to account for 
differential item functioning, DIF) or collapsing response categories are some of the tools used in 
quality control during item calibration. 

Even though the mechanics of item calibrations have become easy with up-to-date 
software and hardware, the calibration of item parameters for matrix samples of item responses 
requires special attention with respect to response rates in balanced incomplete block designs. 
Rates of omitted responses across blocks, context effects, and effects based on block order are 
some of the threats to IRT calibration that can only be touched on in this paper. Technical reports 
of major assessment programs such as NAEP, PISA, and TIMSS contain chapters on quality 
control of item calibrations, von Davier, Sinharay, Oranje, and Beaton (in press) gave an 
overview of the current approach to estimating population characteristics in survey assessments. 
Obviously, some of the issues mentioned here represent challenges for all types of linking. 

Linking across assessment cycles using data from balanced incomplete block designs is a 
nontrivial endeavour for several reasons. First, assessment cycles usually are 2 to 9 years apart, 
so that the difficulty of items may change between cycles. Second, not all blocks from previously 
administered assessments can be reused, since some of the items are usually released together 
with the reports on results of preceding cycles. In addition, assessment frameworks change and 
may contain additional subdomains or may require new item formats. 

In the best case, only released blocks of items will be replaced with new blocks and a 
substantial percentage of blocks will be the same (and administered in the same form, same print, 
and same position) across assessment cycles. In that case, a joint calibration (that accounts for 
the potential nonequivalence of assessment cohorts by estimating separate ability distributions in 
a multiple group IRT model) of both the old and the new sample will accomplish the task, give 
and take a few items that need to be separated between cycles in the case that item difficulties 
were observed to be very different in the two cycles. 


16 



The reality of linking in large-scale and the specifics of the multiple steps of analyses and 
quality control are more complex than in other linking designs, especially if more than one 
previous cycle contains linkage blocks and several subdomains have to be li nk ed. In less than 
ideal (but more realistic) cases, the assessment frameworks and the construct to be measured 
changes somewhat, so that it has to be established empirically whether link items still fit in the 
new framework, in conjunction with the newly introduced items and subdomains. After the joint 
calibration has been carried out, the new (in practice, slightly different) scale has to be linked 
back to the original scale used for the previous assessments. 

3.3 Equating of Scores in the IRT Framework 

If equating of scores is desired, different programs and testing companies use different 
techniques after the calibration process. In some programs, an interim step is perfonned, where 
additional methods, such as IRT true score or observed-score equating are used to link the raw 
scores on the two forms. After that, a third step is employed, which refers to placing the raw 
scores onto some reporting scale. In other programs, the second step is skipped altogether, and 
the ability scale is linked directly to the reporting scale (Yen, 1984) as briefly described below. 

Raw score-to-scale score scoring table. Yen’s method generates a raw score-to-scale 
score scoring table. Placing item parameters from a theta scale on a scale score metric is done by 
(a) defining appealing additive and multiplicative constants (e.g., 400, 40) at the beginning of a 
testing program, and (b) in future years, items are placed on the base ^metric via linking 
(typically using the Stocking and Lord method), and then the (400, 40) transfonnation is applied 
(W. M. Yen, personal communication, December 14, 2005). Some testing programs do this 
transformation in one step, by setting the reference parameters for the Stocking and Lord method 
to the scale score metric. 

Once item parameters are in the scale score metric, they can be used in item response 
pattern scoring or to generate a raw score-to-scale score scoring table, using a procedure like the 
one described in Yen (1984) for the 3PL model or using inverted test characteristic curves 
(TCCs). Additional methods such as true score equating are unnecessary. 

This method is used in scoring tests of state assessments and published tests, such as 
TerraNova (CTB/McGraw-Hill, 2000). 
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IRT true score equating. In this subsection, we briefly describe the IRT true score 
equating that is used by certain testing companies. For example, this method is used at ETS for 
assessments such as the TOEFL® iBT (ETS, 2006b). We do not describe the IRT observed-score 
equating in this paper (a detailed description of it is given in, for example, Kolen & Brennan, 
2004). 

In a NEAT design equating process for tests with multiple-choice items, for example, the 
3PL IRT model is separately fitted to the data from the two populations P and Q in Table 3. Then, 
the characteristic curve method (Stocking & Lord, 1983) is used to place the separately estimated 
item parameters onto a common scale. After all the item and person parameters are on the same 
scale, they are used to estimate the true score equating function. The number-correct true score for 
a given 0 is obtained by summing the (conditional) probabilities over the number of items in the 
test. Then the true score equating method is used to obtain equivalent scores on X and Y (see von 
Davier & Wilson, 2005; Kolen & Brennan, 2004; Petersen, Kolen, & Hoover, 1989). According to 
this method, for a given true score of the new test form A, one finds (via iterative procedures, such 
as the Newton-Raphson method) the value of competency/ability 0 and then computes the true 
score on the old form Y. 

The IRT true-score equating requires that the tests are number-right scored, which 
involves an implicit assumption that there are no omits (see Kolen & Brennan, 2004). If the tests 
are formula-scored, then some sort of transformation is necessary; this transformation will treat 
the omits as wrong. The IRT true score equating introduces one more assumption: The 
relationship between the true scores holds also for the observed scores. This assumption has not 
been theoretically proved, but was confirmed in research studies (Lord & Wingersky, 1984). In 
IRT true score equating that uses the common 3PL IRT model, the lowest possible true score is 
the sum of the Cj, the so-called guessing parameters, and not 0. In this case, there is some 
arbitrariness of the results of the conversion of the observed scores that are outside the range of 
possible true scores on the new form A (see Kolen & Brennan for details). Finally, after the raw- 
to raw conversion has been obtained, the equated scores are placed on the reporting scale. 

3.4. Observed-Scores Equating Methods 

In contrast to IRT linking methods discussed above, the observed-score linking methods 
relate person measures of proficiency from two different tests, in terms of the total scores, 
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without specifying an explicit model for the person by item interaction. In this subsection, we 
briefly describe observed-score equating methods, both the traditional procedures and a newly 
proposed equating procedure. 

The definition of score interchangeability for the observed-scores linking and equating 
methods is based on relationships between the moments (which leads to linear equating) or 
percentiles of the two score distributions (which leads to equipercentile equating) to be equated. 

In addition to the two test forms to be equated, X and Y, we also have an explicit target 
population, T, on which the equating is to be done. In an EG design, the tests are given to two 
samples that are (assumed to be) randomly drawn from a population of examinees, P (in this 
case, we assume that, T = P —see Livingston, 2004, for a slightly different view and definition of 
a target population). The target population, T for the NEAT design, is assumed to be a weighted 
average of P and Q. P and Q are given weights that sum to 1; the population defined in (1) 
below is also called the synthetic population in the literature, to emphasize that units of it are not 
real test takers. This is denoted by 

T = wP + (1 - w)Q. (1) 

The partition of T is determined by the weight w (see Angoff, 1971; von Davier et ah, 2004; or 
Kolen & Brennan, 2004, for a discussion of the target population in the NEAT design and of the 
role of the weights). 

Many observed-score equating methods are based on the equipercentile equating 
function. It is defined on the target population, T, as: 

eyj(x) = Gt\Fi(x )) ( 2 ) 

where Fj{x) and G/(v) are the cumulative distribution functions (cdfs) of X and Y, respectively, 
on T. Given that X and Y are discrete random variables, their cdfs are step functions. However, in 
order for this definition to make sense and to insure that the inverse equating function exists, we 
also assume that Fj(x) and G’/fy) have been made continuous or continuized so that the inverse 
functions exist for F/(x) and G/fv). 

Several important classes of observed-score equating methods may be viewed as only 
differing in the way that the continuization of Fj{x) and Gfiy) is achieved. The traditional 
equipercentile equating method (also called the percentile rank method) uses linear interpolation 
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of the discrete distribution to make it piecewise linear and therefore, continuous. The kernel 
equating (KE; von Davier et ah, 2004; Holland & Thayer, 1989) method uses a Gaussian kernel 
smoothing to approximate the discrete distribution by a continuous density function. 

The equipercentile equating function leads to linear equating if o Ffx) and Gfy) have 
the same shape while differing only in mean and variance. The linear equating function, 

Lin ): 7 <x), is defined by 

Liny /ft) = p yt + ct yi((x - \1xt)/<3xt)- (3) 

In Theorem 1.1 of von Davier et al. (2004), it is shown that any equipercentile equating function can 
be decomposed into the corresponding linear equating function and a nonlinear part. 

The observed-score equating functions for the NEAT design, equipercentile and linear, 
also make assumptions in order to overcome the missing by design data, a feature of the NEAT 
design. The assumptions and the formulas for the classical linear and equipercentile equating are 
given in Kolen and Brennan (2004) and von Davier et al. (2004). 

von Davier et al. (2004) viewed any observed-score test equating as having five steps or 
parts, each of which involves distinct ideas. They are: (a) presmoothing of the score 
distributions, (b) estimation of the score probabilities on the target population, (c) continuization 
of the discrete fitted score distributions, (d) computing the equating function, and (e) computing 
the standard error of equating and related accuracy measures. In some assessment programs, 
where the samples are very large, the practitioner might decide to skip the presmoothing step; 
however, von Davier et al. recommended the presmoothing even for these circumstances. 

The kernel method of test equating. The KE method (von Davier et al., 2004; Holland & 
Thayer, 1989) is an observed-score equating procedure that promises to unify several observed- 
score methods of test equating into a single method while, at the same time, providing new 
statistical information that can be used in the practice of test equating. 

KE brings together these five steps into an organized whole rather than treating them as 
disparate problems. KE exploits presmoothing by fitting loglinear models to score data and 
incorporates the error introduced by the smoothing procedure into step (e) above. KE provides 
new tools for comparing two or more equating functions and for rationally choosing between 
them based on newly introduced indices. 
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Kernel equating is an equipercentile equating procedure in which the score distributions 
to be equated are converted from discrete distributions to continuous distributions by using a 
nonnal (Gaussian) kernel as opposed to using linear interpolation as in the traditional 
equipercentile equating method. By varying the bandwidth values of the Gaussian kernel (see 
von Davier et al., 2004), KE can approximate the traditional equipercentile and linear equating 
methods. When optimal bandwidths are chosen, the KE will approximate the traditional 
equipercentile equating method. The process of choosing the optimal bandwidth is fully 
automatic and involves the minimization of a penalty function. When the bandwidths used are 10 
times the standard deviation of the scores or larger (i.e., large bandwidths), the continuized 
distributions will be nearly nonnal, in which case the KE functions can be regarded as 
approximately linear. Thus, linear equating can be regarded as special case of equipercentile 
equating in the framework of KE. The KE framework also introduces the percent relative error 
(PRE) that aids the diagnosis of the equating function and introduces the standard enor of the 
difference between two equating functions (the SEED). The SEED can help rationalize the 
linear/nonlinear decision. 

As in the IRT true-score equating, several steps are employed in order to equate and 
report scores: The observed-score equating methods are used to link the observed scores on the 
two forms. After that a second step is employed, which refers to placing the raw scores onto 
some reporting scale. 

The observed-score equating methods are mostly used in horizontal equating for tests 
where every test taker has all the items. For tests built using short blocks of items (as in the case 
of a BIB design) where each test taker has some of the blocks, and in most vertical linking 
settings, the IRT methods are preferable for linking purposes. 

3.5 Vertical Scaling 

As in the equating process, the first step in vertical linking is to place the competencies or 
abilities from several test fonns collected from designs of the type described in Tables 4, 5, and 6 
onto a common scale and to rank order the examinees on an interim-score scale. Then the scores 
from each level are linked to a reporting scale, such as a grade-equivalent scale. This procedure 
is called scaling. The scaling procedure can use the observed scores or can be IRT-based. 

The most commonly used linking methods for creating scale scores in vertical scale 
settings are the Hieronymus, Thurstone, and IRT scaling procedures (see Hendrickson, Kolen, & 
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Tong, 2004; Tong & Harris, 2004; Yen, 1986; Yen & Burket, 1997). In all these methodologies, 
an interim scale is chosen (for example, in Table 6 , the data from a representative sample on the 
scaling test V is used for a target scale; in a design such as in Table 5, the tests given to P 3 could 
be the target scale). 

Hieronymus scaling. This method (Petersen et ah, 1989) was developed for data 
collection designs with a scaling test, as shown in Table 6 ; the method makes use of the total 
number-correct score for dichotomously scored tests or the total number of points for 
polytomously scored items. The scaling test is constructed to be representative of content from 
the lowest to the highest level of testing, and it is administered to a representative sample from 
each testing level or grade. The true-score distributions of competencies at each level are 
assumed to have the same mean as the observed distributions but a particular variance, as 
expected following classical test theory; it is assumed that the scaling test is representative of the 
domain/construct to be measured at each level. To conduct Hieronymus scaling, the median 
number-correct score on the scaling test V from Table 6 for each grade level is assigned a 
prespecified score scale value (these values are based on various considerations that include the 
domain to be measured, the measurement point in time in relation to the domain, etc.—see Kolen 
& Brennan, 2004, p. 382). Hence, the within- and between-level variability and growth are 
determined on an external scaling test, which is the special set of common items described in the 
design from Table 6 ; this design is usually paired with this scaling method. 

Thurstone scaling. This process (Thurstone, 1925, 1938) creates first an interim-score- 
scale as above and then normalizes the distributions of the variables at each grade. That is, it 
assumes that scores on an underlying scale are normally distributed within each group of interest 
and makes use of the total number-correct scores for dichotomously scored tests or the total 
number of points of polytomously scored items to conduct scaling. These normalized 
distributions are place on the final score scale using the means and variances of these normal 
distributions at each grade. In other words, Thurstone scaling normalizes the raw scores and 
linearly equates, and it is usually conducted with equivalent groups. 

IRTscaling. Model-based, IRT scaling considers the person-items interactions. In theory, 
at least, one could conceive of an IRT scaling for all existing IRT models, including 
multidimensional IRT models or diagnostic models. However, in the practice of vertical scaling 
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of educational assessments, the models used are unidimensional models such as the Rasch and/or 
partial credit models (PCM) or the 3PL and/or generalized partial credit (GPC) models. 

The methods for the item-parameter linking are those described before under IRT 
calibration (see also Harris & Hoover, 1987). The only major difference between the use of IRT 
in horizontal linking and the use in vertical linking is that if concurrent calibration is used for 
linking the item parameters onto the same scale for all the tests taken by several very different 
populations of test takers, the estimation method should allow for estimation of the parameters of 
multiple ability distributions. This estimation can be done separately by level and then linked 
(via characteristic curves, for example), or it can be done via a concurrent calibration with 
multiple populations. 


4. Measuring Growth in Longitudinal Settings 

There are many models available for measuring intra- and interindividual growth. The 
goal of this section is to review some of these models and to place them in a common 
background. We will first present the layout that is common to most measurement of change 
studies. Then we will mention some IRT extensions of the current methods for modelling 
change, which are the structural equation models (SEMs) and hierarchical or multilevel models 
(H/MLMs). In this section, we do not refer to survival analysis, and we make only a few 
references to causal inference analysis. 

This section relies on the sections above by assuming that (a) the same measurement 
instrument is used on all measurement occasions and this instrument continues to measure the 
same construct or that (b) a common scale has been established. 

Although the abundance of available methods is a good thing, measuring growth is still 
not an easy endeavour. In applied studies using models for measuring growth, one often finds 
considerable sparseness in the data across time points: There seems to be a general challenge in 
obtaining samples that are large enough to support an accurate estimation of the (complex) 
models—as the number of measurement points in time, as the data per time measurement, and as 
the data per subgroup of interest. Therefore, it appears that the inferential ambitions of the 
models usually exceed the capacity of the data to support these inferences, and, at the same time, 
the consequences of these inferences might be serious in the political and social framework 
where these types of research and analyses are required. 
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This section focuses on defining the type of growth mentioned above, on reviewing the 
assumptions required by (most) of the existing methodologies, and on reviewing several 
approaches for measuring growth. Also, it will briefly discuss the differences between intra-and 
interindividual measurements of growth. Hence, here we try to be explicit about what are the 
research questions behind the different types of measurement of growth models and the 
assumptions that underlie them, and we briefly review several approaches. 

Research questions, von Davier and von Davier (in press) identified four research 
questions that motivate the study of change (regardless of the domain—psychology, education, 
or other fields): 

1. What is the change that each person experiences over time? (individual change) 

2. Do the rates at which each individual changes differ by values/outcomes of 
background variables? (interindividual systematic change) 

3. Does a specific treatment have an effect on how an individual changes? (causal 
inferences) 

4. How does a cohort change over time? 

Obviously, an answer to the first question on the individual trajectories over time is a 
prerequisite for answers to the subsequent two questions. In other words, modelling the change 
that an individual person experiences with time is at the core of the study of change. However, 
the answer to the fourth research question may or may not rely on the individual trajectories. 

Assumptions. When we talk about process assumptions, we refer to those assumptions 
necessary to answer the questions above, von Davier and von Davier (in press) identified three 
types of process assumptions: (a) assumptions about the data, (b) assumptions about the 
instrument/ outcome variable, and (c) assumptions about the model(s). 

Assumptions about the data. The data should have appropriate features depending on 
which research question a study aims to answer. To answer research questions such as the first 
and second above, ideally data should be available longitudinally on many individuals (when 
time points and individuals have been sampled representatively) for at least three time points. 
Ideally, data should be balanced (although the H/MLM type of approaches can relax this 
requirement). If one wants to make causal inferences (the third research question above), then the 
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similarity between units/examinees across treatment and control groups should be ensured 
(Holland, 2005; Raudenbush, 2004). 

To answer questions such as the fourth one above, the data does not necessarily need to 
be longitudinal. A design where random samples are independently drawn with replacements 
from a cohort at different time points might suffice. 

Assumptions about the outcome variable. In SEM/HLM approaches the outcome variable 
must be a continuous variable at either the interval or the ratio level. In other approaches, 
however, the outcome variable does not need to be continuous. Models for categorical outcome 
variables can be handled by extensions of SEM/HLM models or IRT models for change 
(Cronbach & Furby, 1970; Embretson, 1991; Meiser, 1996; Raudenbush, 2004; Willett & Sayer, 
1994; Wilson, 1989). 

The outcome variable must be comparable across time points (i.e., each scale point of the 
measure must retain an identical meaning over time). The outcome variable must remain construct- 
valid for the entire period of observation. These two assumptions speak to the necessity of 
establishing a common scale for the instruments) as discussed previously in this paper. 

Model assumptions. To answer the first research question, one needs a model for the 
individual change with change/growth parameters that aims to model the individual trajectories 
over time. 

Ideally, this model is based on a substantive theory. In most cases, the model has to be 
simple, since only a few time points are available: for example, a linear or a quadratic function of 
time with a stochastic measurement error (and additional technical assumptions on the 
distribution of error terms). Usually, plots of the observed patterns of change are used as first and 
explorative steps. 

If a study aims to answer the second (or third) type of research question(s) separately or 
simultaneously with the first question, an additional model is needed. A measurement of change 
study that tries to explain the systematic differences among the individual trajectories introduces 
a model for the interindividual change in the change/growth parameters from the first model. 

These second order models are based on distributional assumptions about individual 
growth parameters and contain additional predictors of change. The second order model might 
try to answer both types of questions related to interindividual differences in change (the second 
and third research questions). The difference between the models is in how the results are 
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interpreted given the data at hand and given the additional requirement of similarity of 
units/subjects for causal inference as it was mentioned above. 

A model that aims to answer only a research question of type four (i.e., to measure 
changes of population distributions over time) may or may not rely at all on a model of 
individual trajectories. 

4.1 Measuring Change Using IRT 

The purpose of this section is to briefly describe the approaches used for measuring inter- 
and intraindividual growth using IRT models. This section relays heavily on the literature survey 
described in von Davier and von Davier (in press). 

Measuring individual growth. In this category, we place IRT models that incorporate 
measuring change (and the IRT calibration/linking if the test fonns differ) in one step. Gluck and 
Spiel (1997) and Meiser, Stern, and Langeheine (1998) provided comprehensive theoretical 
descriptions of several relevant IRT models as well as detailed comparisons of the results of the 
models’ applications to the same data. These papers show how to apply different IRT models to 
data with repeated observations on the same items and the same individuals at different 
occasions. Both papers show how one could model change using the 1PL IRT model (Rasch 
model) and latent class models: mover-stayer mixed Rasch model (Gluck & Spiel), linear logistic 
test model (LLTM) for measuring change—generalized LLTM (Fischer & Ponocny, 1994, 

1995), and mixture distribution Rasch model (von Davier & Rost, 1995; Rost, 1990). In the 
examples provided in these papers, the samples used to illustrate the methods were small, 
especially in the Gluck and Spiel (1997) paper. In these studies, the outcome variable is discrete. 
The software that was used in the analyses reported in these papers was LPMC (Fischer & 
Ponocny, 1994) and WinMira (von Davier, 2001). Other IRT-based approaches that could be 
placed in the same category are the multidimensional Rasch model (Keldennan, 1996) and the 
Saltus model (Wilson, 1989). 

Hierarchical models for vertical scaling. In this category, we mention two approaches 
described by Patz et al. (2003) and von Davier and von Davier (2004) that try to address vertical 
linking and a more complex modelling of growth in a one step procedure. In addition to 
establishing the scale, the paper by Patz et al. proposed a method for modelling growth across 
years. The scaling and the growth modelling are both cast in a full Bayesian framework in a 
hierarchical model. The hierarchical multigroup IRT approach proposed by Patz et al. allows the 


26 



explicit estimation of the functional form (assumed to be quadratic) of the grade-to-grade growth 
patterns. The parameters of this complex model are subsequently estimated using an MCMC 
approach. The paper also investigates a multidimensional, multigroup IRT model that captures 
differences in dimensionality and scale definition across grade levels. The hierarchical approach 
proposed by Patz et al. is “a more general version of concurrent estimation of the unidimensional 
IRT model” (p. 40) and their motivation was to unify the two most commonly used linking 
methods for vertical equating, the concurrent calibration method and separate calibration 
followed by a test characteristic curves linking. 

Hierarchical models with IRT measurement models. This category includes models that 
try to explain interindividual differences. If the measurement instruments differ, then these 
approaches assume that the scores have been placed onto a common scale (see Raudenbush, 

2001, or Willett & Sayer, 1994). The models discussed here represent extensions of the SEM or 
HLM approaches. Both SEMs and HLMs can borrow from statistical methods that allow for 
non-normal outcome variables. The link functions used in general linear models (GLMs) and the 
item response functions used in IRT are obvious candidates for such an adoption of methods. 

The Rasch model is the model of choice in many applications of these merged models 
because of its alternative view as either a logistic regression model or as a loglinear model. 

These alternate views facilitate the use of Rasch model in existing developments of SEM and 
HLM that are based on these alternative formulations. Kamata (2001) presented a model that 
represents an integration of the hierarchical linear model and the Rasch model. Raudenbush and 
Sampson (1999) presented an interesting application of such a model and showed how to 
estimate indirect effects of a three-level hierarchical model. The authors introduced a Rasch type 
measurement model (the model for the latent variables) at the first level. The next two levels are 
the usual ones for the (hierarchical structure of the) data. More specifically, the levels used in 
Raudenbush and Sampson, Raudenbush, Johnson, and Sampson (2003), and Johnson and 
Raudenbush (in press) are the following: The first level is a logistic regression with mixed 
effects, the random effects are subjects and groups of subjects (blocks), and the fixed effects are 
items; the second level has fixed effects for time interval of observation, additional true values of 
the two latent variables, and an error term; and the third level has the grand mean and an error 
(variance, covariance) term. 
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The features that make this approach appealing for analyses of data from educational and 
social sciences include (a) the model incorporates three levels with time being introduced on 
second level, (b) the Rasch type modelling of items on level one enables the use of binary 
observables instead of sum scores, and (c) the model can be estimated with penalized ML using 
Laplace approximation described in Raudenbush, Yang, and Yosef (2000), which is estimable in 
HLM. 

One important aspect of these models is that an IRT type measurement model on level 
one needs to assume certain invariance properties. Again, one minimal assumption is that a 
common scale has been established. 

Problems may arise if the scale linkage across time points relies only on subsets of items 
or link items phased out over a limited number of time points to be replaced by other items. In 
this case, the link can be assumed to be somewhat weaker, and it can be argued with some 
justification whether what changed was not the subject, but the subject matter (i.e., we may not 
be able to establish that the same latent variable was measured over time). 

This is much less a problem in short-tenn studies (where the time frame is days or weeks 
rather than school years) like the one described above, but it is a serious problem in educational 
assessments and studies in developmental psychology, where some items actually become 
obsolete as children grow older or go through certain educational levels. 

5. Discussion 

This paper reviews the existing methodologies for equating and linking tests that measure 
the same construct over time. The first part of the paper describes horizontal equating, where 
interchangeability of the scores is desired. Then, vertical scaling is discussed as it is used in 
educational assessments, where measuring growth in a particular domain and comparability of 
scores on test forms that measure the same construct but differ in difficulty is desired. We also 
address the challenges of covering large content domains in educational survey assessments over 
many cycles and discuss some of the solutions that have been developed using extensions of IRT 
models for reporting subgroup distributions. In the previous section, we discussed some existing 
explanatory models for inter- and intraindividual growth. Each of these areas is a large field in 
itself, and it potentially has a strong impact on the educational policies, on the life of students 
and parents, or on the life of professionals. 
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Nowadays, when more and more standardized testing is used nationally and 
internationally, we are also discovering more challenges in ensuring that the process and the 
results are fair and accurate. For example, among the challenges and research opportunities for 
test linking, we easily can mention the definition of growth, the construction of the anchor sets, 
the choice of the reporting scale, and the characteristics of the samples used for establishing the 
scale, which ideally should be representative for the population of test takers. In addition, 
psychometricians worry about the maintenance of the scale: how to introduce new forms, how to 
monitor the scale over time, and how to adjust to changes in the administration mode. 

In conclusion, we have noticed that many researchers and practitioners already work 
together in addressing these challenges, and we hope that many universities will consider 
implementing training curricula that prepare the future generation of psychometricians for the 
challenges in the field of education in the 21st century. 
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Notes 


1 This claim is appropriate if models such as the Rasch model (1PL) and (to some extent) the 
one-parameter logistic model (OPLM; Verhelst & Glas, 1995) fit the data, so that conditional 
maximum likelihood can be used for estimating the item parameters. Strictly speaking, more 
complex IRT models that require joint estimation or a distributional assumption for the latent 
ability variable when estimating item parameters do not share the feature of parameter 
separability. 
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